Data-Driven Predetermination of Cu Oxidation State in Copper Nanoparticles: Application to the Synthesis by Laser Ablation in Liquid

Copper-based nanocrystals are reference nanomaterials for integration into emerging green technologies, with laser ablation in liquid (LAL) being a remarkable technique for their synthesis. However, the achievement of a specific type of nanocrystal, among the whole library of nanomaterials available using LAL, has been until now an empirical endeavor based on changing synthesis parameters and characterizing the products. Here, we started from the bibliographic analysis of LAL synthesis of Cu-based nanocrystals to identify the relevant physical and chemical features for the predetermination of copper oxidation state. First, single features and their combinations were screened by linear regression analysis, also using a genetic algorithm, to find the best correlation with experimental output and identify the equation giving the best prediction of the LAL results. Then, machine learning (ML) models were exploited to unravel cross-correlations between features that are hidden in the linear regression analysis. Although the LAL-generated Cu nanocrystals may be present in a range of oxidation states, from metallic copper to cuprous oxide (Cu2O) and cupric oxide (CuO), in addition to the formation of other materials such as Cu2S and CuCN, ML was able to guide the experiments toward the maximization of the compounds in the greatest demand for integration in sustainable processes. This approach is of general applicability to other nanomaterials and can help understand the origin of the chemical pathways of nanocrystals generated by LAL, providing a rational guideline for the conscious predetermination of laser-synthesis parameters toward the desired compounds.


In the file "Supporting_Information_Tables S3-7.xlsx"
-Table S3: Results with Artificial Neural Networks -Table S4: Ranking of features from four ML models -Table S5: Hyperparameters of the best ML models using 9 features -Table S6: Hyperparameters of the best ML models using 10 features -Table S7: Hyperparameters of the best ML models using 11 features GA-feature-subset-selection

S1. Single-parameter linear regression analysis (features and super-features)
For the single-parameter linear regression analysis of (feature, output) couples, the dataset was shifted to (feature', output') couples to permit the use of a Log-Log scale (decimal log), and the linear regression was applied to this shifted Log-Log scale dataset.The sketch of the procedure for adaptation of (feature, output) or (super-feature, output) data to the Log-Log plot, so that the linear regression of the dataset can be performed, is described in Figure S1.A Log-Log plot is required for applying the linear regression also when there is no information on the occurrence of linear or nonlinear correlation in the dataset.However, some data have negative or 0 values (Figure S1A), which cannot be converted directly to a Log-Log scale.To avoid the exclusion of data ≤ 0 from the Log-Log plot, each list was rigidly shifted to ensure that its minimum become > 0 (Figure S1B).The procedure worked in two steps: (i) each list was shifted so that the minimum of each list become 0; (ii) each list was shifted again so that the minimum of each list become 1/20 of the penultimate value, leading to the final (feature', output') or (super-feature', output') datasets.Finally, these shifted but complete datasets were used for the linear regression (Figure S1C).

Figure S1
. Sketch of the procedure for adaptation of (feature, output) or (super-feature, output) data to the Log-Log plot (decimal log) before the linear regression of the dataset.Some data have negative or 0 values (A), which cannot be converted directly to a Log-Log scale.To avoid the exclusion of data ≤ 0 from the Log-Log plot, each list was rigidly shifted to ensure that its minimum become > 0 (B).The procedure worked in two steps: (i) each list was shifted so that the minimum of each list become 0; (ii) each list was shifted again so that the minimum of each list become 1/20 of the penultimate value, leading to the final (feature', output') or (super-feature', output') datasets.Finally, these shifted but complete datasets were used for the linear regression (C).
From the linear regression of (feature', output') datasets on a Log-Log scale, easily understandable parameters such as the R 2 and the standard error (s.e.) on the slope of the linear fit were identified for P12 (dashed red lines, Figure S2C) shows a dataset arranged in three sub-groups which cannot be described by a linear trend.If, on one hand, the largest R 2 is meaningful of the dominant role of the % of O+Cl+CN+S in solvent molecules for the oxidation state of Cu, on the other hand the grouping of the (P12', output') dataset suggests that cross-correlations with other features are indispensable to determine the output.It should be noted also that in the literature some synthesis parameters are much more frequent than others.This leads to the inhomogeneous distribution of points in the features space with the accumulation of many points at the most frequent features, strongly affecting the results of the linear regression.Hence, the datasets of the average output (<output>) for each feature were also calculated and used for the linear regression.This average is obtained considering that, for each feature Pi in our database, there will be a number N(Pi = ) of entries Pi,j with the same value , and the corresponding <output> is obtained by dividing the sum of all the <output>j by N(Pi = )  Log(<Output>') (eq.S1) Figure S3.Sketch of the procedure for the calculation of (feature, <output>) couples, which are subsequently transformed in (feature', <output>') couples as described in Figure S1 and fitted with a linear regression with the results reported in Figure 2D-E-F.
Then, the linear regression was performed on the (Log[feature'], Log[<output>']) couples obtained as described in Figure S1.In this way, the linear regression equally weighted all features, independent of the frequency of their use in the literature.The results show a net increment of the R 2 for several features (Figure S2D) but also an increment of the relative s.e.(Figure S2E) because of the lower number of data points in the linear fit.The features with the highest R 2 and the lowest relative s.e. are P11: # of atoms of solvent molecules (0.6394, 25 %), P35: Minimum ionization potential of the solute (0.5115, 27 %), P23: Density of solvent (0.4002, 35 %) and P12: % of O+Cl+CN+S of solvent (0.3898, 38 %).By crossing the results with those of Figure S2A-B, we confirmed that P11, P12 and P23 are the features with the strongest correlation with the oxidation state of the Cu NPs.

The regression analysis on single features does not tell if a combination of features is correlated with
the output due to a synergic effect.For instance, literature analysis indicated that the oxidation state of Cu in NPs obtained in organic solvents may be lower when an inert gas atmosphere is used instead of ambient air, due to the presence of oxygen, 1 meaning that the combination of features of solvent and gas atmosphere is correlated with the average oxidation state.
Hence, the super-features SPj were generated according to where NP is the total number of single features Pi considered, aij is their exponent taken from the combination ( ) (eq.S3) with j the index identifying the given combination among all those possible for the NP features with one among three possible exponents (numerator: 1, denominator: -1, absent: 0).The number of combinations scales as 3 Np and is of the order of 1.5 10 17 for NP = 36.This was too much for the available computational capabilities.Hence, the features were initially divided into the four groups properties of the prevailing solute (G4, NP = 8).These groups were screened for the SPj with the best correlation under a linear regression, leading to the identification of other subsets of features labelled SP_C (all features with exponents different from 0 in the super-features with the highest R 2 obtained from SP_G1, SP_G2, SP_G3 and SP_G4) and SP_D (all the features with average exponent different from 0 in the combinations with the 100 highest R 2 obtained from SP_G1, SP_G2, SP_G3 and SP_G4).Moreover, also the sixteen features with the highest R 2 in the histogram of Figure S2A and The results (see Figure S4 and Tables S1-2 in S.I.) indicate an increment of R 2 when passing from single features (Pi) to their combinations SPj.SP_E, SP_F, SP_H and SP_I all have an R 2 slightly larger than 0.3 and s.e.< 10 %.We hypothesize that errors in the literature data, due to the experimental variables that cannot be accounted for in the database such as NPs ageing, target ageing and inaccurate assessment of Cu NPs composition, surely contribute to the low R 2 , by lowering the possible correlation between the output and the super-features.
According to the linear regression analysis with the super-features, the best result (R (eq.S5).
However, the R 2 is low and the dataset for SP_F is still divided into three groups as observed for P12 (dashed red lines in Figure S4C).This suggests the absence of a physical or chemical reason for expecting a simple linear correlation between the Log of features or super-features and the oxidation states of Cu NPs.Instead, Figure S4C indicates that the relationship is much more complex and the cross-correlations between features and Cu oxidation state are not completely accessible with the linear regression approach.

S2. Comparison of the ML performances with the top 9, 10 and 11 features
The ML models were optimised also by including the 10 th and 11 th feature in the ranking of Table S4, which belong to G3 about the physical properties of the solvent (P18: Refractive index at 589 nm, P24: Boiling point).3][4][5][6] Therefore, the increase of the number of feature beyond 9 is unnecessary and not insightful about the synthesis process.Set-up parameters are fixed as reported in the tables above each graph considering typical experimental conditions for kHz LAL with ns (A) or ps (B) pulses and MHz LAL (C).All predictions are obtained with the best model (Voting Regressor).The general trend of the predictions is similar to the results in Figure 6 of the main article, where ns laser pulses with repetition rate of 50 Hz and energy in the range of mJ/pulse are adopted.However, the predictions for a higher repetition rate and lower pulse energy indicate a slight increase of the oxidation state, which can be reduced by shortening the pulse duration from ns to ps.

Figure S2 .
Figure S2.R 2 (A) and s.e.(%) on the slope (B) for the linear regression of single feature analysis.(C) Log-Log plot for the feature P12 with the best correlation with Cu oxidation state.The dashed lines identify three subsets of data which are not fitted simultaneously by the model.(D-E) R 2 (D) and s.e.(%) on the slope (E) for the linear regression of single feature analysis versus the average of the Cu oxidation state for each feature value (<output>).(F) Log-Log plot for the three features (P11, P35, P23) with the best correlation with the average Cu oxidation state for each feature value.

Figure
Figure S2D were considered, labelled respectively SP_A and SP_B.Other combinations (SP_E, SP_F, SP_G, SP_H, SP_I) were also assessed taking the values with exponents systematically different from 0 in the super-feature with the highest R 2 obtained from SP_A, SP_B, SP_C and SP_D.

Figure S4 .
Figure S4.R 2 (A) and s.e.(%) on the slope (B) for the linear regression of the super-features analysis.(C) Log-Log plot for the super-feature SP_F, also shown in the figure, which has the best correlation with the oxidation state of Cu, as described by the equation below the plot.The dashed lines identifiy three subsets of data which are not fitted simultaneously by the model.

Figure S5 .
Figure S5.(A-C) R 2 of the linear fit of the predicted values versus real values for the training and test datasets taken from the literature and using the best 9, 10 or 11 features resulting from the ranking in TableS4.Red bars report the R 2 of the linear fit for the predicted values versus the values obtained from the experiments in this study.MAE (D-F) and RMSE (G-I) are also reported showing that the models maintain more stable or better performance with 9 features.

Figure S7 .
Figure S7.Prediction of the variation of the oxidation state as a function of the % of O+Cl+CN+S in solvent molecules (P12) and solute molecules (P31) at two solute concentrations (P36 = 0.001 and 0.1) and two different numbers of atoms in solvent molecules (P11: 3 and 12), and as a function of the number of atoms (P11) and of the % of O+Cl+CN+S (P12) in solvent molecules for a % of O+Cl+CN+S in solute (P31) of 100 and two solute concentrations (P36 = 0.1 and 0.001).Set-up parameters are fixed as reported in the tables above each graph considering typical experimental conditions for kHz LAL with ns (A) or ps (B) pulses and MHz LAL (C).All predictions are obtained with the best model (Voting Regressor).The general trend of the predictions is similar to the results in Figure6of the main article, where ns laser pulses with repetition rate of 50 Hz and energy in the range of mJ/pulse are adopted.However, the predictions for a higher repetition rate and lower pulse energy indicate a slight increase of the oxidation state, which can be reduced by shortening the pulse duration from ns to ps.

Table S1 .
Summary of super-features (SPj) used for the linear regression analysis.For each feature included in the SPj the exponent (aij) which maximised the R 2 of the linear regression of the (SPj', output') datasets are reported.

Table S2 .
Summary of linear regression analysis for the various super-features tested.

Table S4 .
Red bars report the R 2 of the linear fit for the predicted values versus the values obtained from the experiments in this study.MAE (D-F) and RMSE (G-I) are also reported showing that the models maintain more stable or better performance with 9 features.